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ABSTRACT 

Motivation: Over the last decade, a large variety of clustering algo- 
rithms have been developed to detect coregulatory relationships 
among genes from microarray gene expression data. Model based 
clustering approaches have emerged as statistically well grounded 
methods, but the properties of these algorithms when applied to 
large-scale data sets are not always well understood. An in-depth 
analysis can reveal important insights about the performance of the 
algorithm, the expected quality of the output clusters, and the possi- 
bilities for extracting more relevant information out of a particular data 
set. 

Results: We have extended an existing algorithm for model based 
clustering of genes to simultaneously cluster genes and conditi- 
ons, and used three large compendia of gene expression data for 
S. cerevisiae to analyze its properties. The algorithm uses a Baye- 
sian approach and a Gibbs sampling procedure to iteratively update 
the cluster assignment of each gene and condition. For large-scale 
data sets, the posterior distribution is strongly peaked on a limited 
number of equiprobable clusterings. A GO annotation analysis shows 
that these local maxima are all biologically equally significant, and 
that simultaneously clustering genes and conditions performs better 
than only clustering genes and assuming independent conditions. A 
collection of distinct equivalent clusterings can be summarized as a 
weighted graph on the set of genes, from which we extract fuzzy over- 
lapping clusters using a graph spectral method. The cores of these 
fuzzy clusters contain tight sets of strongly coexpressed genes, while 
the overlaps exhibit relations between genes showing only partial 
coexpression. 

Availability: GaneSh, a Java package for coclustering, is available 
under the terms of the GNU General Public License from our website 
at http://bioinformatics.psb.ugent.be/software. 
Contact: yves.vandepeer@psb.ugent.be 
Supplementary Information: available on our website at 
http://bioinformatics.psb.ugent.be/supplementary.data/anjos/gibbs 

1 INTRODUCTION 

Since the seminal paper by Eisen et al. (1998), now almost a decade 
ago, clustering forms the basis for extracting comprehensible infor- 
mation out of large-scale gene expression data sets. Clusters of 
coexpressed genes tend to be enriched for specific functional cate- 
gories (Eisen et al, 1998), share cw-regulatory sequences in their 
promoters (Tavazoie et al, 1999), or form the building blocks for 
reconstructing transcription regulatory networks (Segal etal., 2003). 
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A variety of heuristic clustering methods have been used, such 
as hierarchical clustering (Eisen et ah, 1998), fc-means (Tavazoie 
et al., 1999), or self-organizing maps (Tamayo et al., 1999). Alt- 
hough these methods have had an enormous impact, their statistical 
properties are generally not well understood and important parame- 
ters such as the number of clusters are not determined automatically. 
Therefore, there has been a shift in attention towards model based 
clustering approaches in recent years (Yeung et al, 2001; Fraley 
and Raftery, 2002; Medvedovic and Sivaganesan, 2002; Medvedo- 
vic et al, 2004; Qin, 2006; Dahl, 2006). A model based approach 
assumes that the data is generated by a mixture of probability dis- 
tributions, one for each cluster, and takes explicitly into account 
the noisyness of gene expression data. It allows for a statistical 
assessment of the resulting clusters and gives a formal estimate for 
the expected number of clusters. To infer model parameters and clu- 
ster assignments, standard statistical techniques such as Expectation 
Maximization or Gibbs sampling are used (Liu, 2002). 

In this paper we use a novel model based clustering method 
which builds upon the method recently introduced by Qin (2006). 
We address two key questions that have remained largely unans- 
wered for model based clustering methods in general, namely 
convergence of the Gibbs sampler for very large data sets, and 
non-heuristic reconstruction of gene clusters from the posterior 
probability distribution of the statistical model. 

In the model used by Qin (2006), it is assumed that the expres- 
sion levels of genes in one cluster are random samples drawn from 
a Gaussian distribution and expression levels of different experi- 
mental conditions are independent. We have extended this model to 
allow dependencies between different conditions in the same cluster. 
Medvedovic et al. (2004) used a multivariate normal distribution to 
take into account correlation among experimental conditions. Our 
approach consists of clustering the conditions within each gene clu- 
ster, assuming that the expression levels of the genes in one gene 
cluster for the conditions in one condition cluster are drawn from 
one Gaussian distribution. Hence our model is a model for coclu- 
stering or two-way clustering of genes and conditions. The same 
statistical model was also used in our recent approach to reconstruct 
transcription regulatory networks (Michoel et al., 2007). The coclu- 
stering is carried out by a Gibbs sampler which iteratively updates 
the assignment of each gene, and within each gene cluster the assi- 
gnment of each experimental condition, using the full conditional 
distributions of the model. 

It is known that a Gibbs sampler may have poor mixing proper- 
ties if the distribution being approximated is multi-modal and it will 
then have a slow convergence rate (Liu, 2002). Previous studies of 
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Gibbs samplers for model based clustering have not reported conver- 
gence difficulties (Medvedovic and Sivaganesan, 2002; Medvedovic 
et al, 2004; Dahl, 2006). In those studies, only data sets with a rela- 
tively small number of genes (upto a few 100) (Medvedovic and 
Sivaganesan, 2002; Medvedovic et al., 2004), or a small number 
of experimental conditions (less than 10) (Dahl, 2006) were consi- 
dered, and special sampling techniques such as reverse annealing 
(Medvedovic et al., 2004) or merge-split proposals (Dahl, 2006) 
were sufficient to generate a well mixing Gibbs sampler. We observe 
that for data sets of increasing size the correlation between two 
Gibbs sampler runs as well as the number of cluster solutions visited 
in one run after burn-in steadily decreases. This means that for large- 
scale data sets, the posterior distribution is very strongly peaked on 
multiple local modes. Since the peaks are so strong, we approximate 
the posterior distribution by averaging over multiple runs performed 
in parallel, each converging quickly to a single mode. By compu- 
ting the correlation between different averages of the same number 
of runs we are able to show that the number of distinct modes is 
relatively small and accurate approximations to the posterior distri- 
bution can be obtained with as few as 10 modes for around 6000 
genes. 

To identify the final optimal clustering, the traditional approach is 
to select out of all the clusterings visited by the Gibbs sampler the 
one which maximizes the posterior distribution (maximum a poste- 
riori (MAP) clustering). However, we show that for large data sets 
the differences in likelihood between the different local maxima are 
extremely small and statistically insignificant, such that the MAP 
clustering is as good as taking any local maximum at random. A GO 
(Ashburner et al., 2000) analysis of the different modes shows that 
also from the biological point of view any difference between the 
local modes is insignificant. Taking into account the full posterior 
distribution is more difficult since different clusterings may have a 
different number of clusters and the labeling of clusters is not uni- 
que (the label switching problem (Redner and Walker, 1984)). The 
common solution to this problem is to consider pairwise probabili- 
ties for two genes being clustered together or not (Medvedovic and 
Sivaganesan, 2002; Medvedovic et al, 2004; Dahl, 2006). A major 
question that has not yet recieved a final answer is how to reconstruct 
gene clusters from these pairwise probabilities. Medvedovic and 
Sivaganesan (2002) and Medvedovic et al. (2004) use a heuristic 
hierarchical clustering on the pairwise probability matrix to form 
a final clustering estimate. Dahl (2006) introduces a least-squares 
method, which selects out of all clusterings visited by the Gibbs 
sampler the one which minimizes a distance function to the pair- 
wise probability matrix. In both approaches, the probability matrix 
is reduced to a single hard clustering. This necessarily removes non- 
transitive relations between genes (such as a low probability for a 
pair of genes to be clustered together even though they both have 
relatively high probability to be clustered with the same third gene) 
which may nevertheless be informative and biologically meaningful. 

We propose that the pairwise probability matrix reflects a .soft or 
fuzzy clustering of the data, i.e., genes can belong to multiple clu- 
sters with a certain probability. To extract these fuzzy clusters from 
the pairwise probabilities we use a method from pattern recognition 
theory (Inoue and Urahama, 1999). This method iteratively com- 
putes the largest eigenvalue and corresponding eigenvector of the 
probability matrix, constructs a fuzzy cluster with the eigenvector, 
and updates the probability matrix by removing from it the weight of 
the genes assigned to the last cluster. By only keeping genes which 



belong to one fuzzy cluster with very high probability we obtain 
tight clusters which show higher functional coherence compared to 
standard clusters. Keeping also genes which belong with lower but 
still significant probability to multiple fuzzy clusters, we can ten- 
tatively identify multifunctional genes or relations between genes 
showing only partial coexpression. We show that our results are in 
good agreement with previous fuzzy clustering approaches to gene 
expression data (Gasch and Eisen, 2002). We believe that our fuzzy 
clustering method to summarize the posterior distribution will be of 
general interest for all model based clustering approaches and sol- 
ves the problems associated to heuristic clusterings of the pairwise 
probability matrix. 

All our analyses are performed on three large-scale public com- 
pendia of gene expression data for 5. cerevisiae (Spellman et al, 
1998; Gasch et al, 2000; Hughes et al, 2000). 

2 METHODS 

Mathematical model 

For an expression matrix with TV genes and M conditions, we define a coclu- 
stering as a partition of the genes into K gene clusters Qf^, together with 
for each gene cluster, a partition of the set of conditions into Lj; condition 
clusters We assume that all data points in a cocluster {(j,m): i € 

Qkt'm G £k.l} 3re random samples from the same normal distribution. 
This model generalizes the model used by Qin (2006), where the pailition of 
conditions is always fixed at the trivial partition into singleton sets. 

Given a set of means and precisions {fi^i i Tfej ), a coclustering C defines a 
probability density on data matrices T) = (xi^) by 

p(l> I C, (/ifej.Tfei)) = ]^ f][ ]^ Y[ P{Xim \ tJ-kl,rki). 

fc=i ;=i ieSfc m&Sk,i 

We use a uniform prior on the set of coclusterings with normal-gamma con- 
jugate priors for the parameters /i^.; and r^;. Using Bayes' rule we find the 
probability of a coclustering C with parameters {^ikl^'^kl) given the data 
D. Then we take the marginal probability over the parameters (/ifc;,Tfe;) 
to obtain the final probability of a coclustering C given the data T), upto a 
normalization constant: 
K 

p{c I ©) oc n n yy p[^l, r) n n i ^) 

where p(/^, t) = | t)p(t) with 

^^'^ ' ^ ^ 27r ' ' r(ao) 

qq, /3o, Ao > and — oo < /^o < oo being the parameters of the normal- 
gamma prior distribution. We use the values ao = /3o = Ao = 0.1 
and iiQ = 0.0, resulting in a non-informative prior. We have compared 
the normal-gamma prior with other non-informative, conjugate priors, but 
found no difference in results (see Supplementary Information). The double 
integral in eq. (1) can be solved exactly in terms of the sufficient stati- 
stics T^r' = EiGSfc,me£j,, ^"m = 0. 1' 2) for each cocluster The 
log-hkelihood or Bayesian score decomposes as a sum of cocluster scores: 

S(,C) = \ogp(C\V) = J2J2^'"' (2) 

fc=i 1=1 

with 

= -^T^^ log{2^) + i log( ^° ) - logrCao) 
Ao + T^i 

+ logr(ao + 5^^?') + ao log/3o - («□ + iT^°^)log/3i 
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Gibbs sampler algorithm 

We use a Gibbs sampler to sample coclusterings from the posterior distribu- 
tion (1). The algorithm iteratively updates the assignment of genes to gene 
clusters, and for each gene cluster, the assignment of conditions to condition 
clusters as follows: 

1. Initialization: randomly assign A'^ genes to a random Kg number of 
gene clusters, and for each cluster, randomly assign M conditions to a 
random Lf^ o number of condition clusters. 

2. For N cycles, remove a random gene i from its current cluster. For each 
gene cluster k, calculate the Bayesian score S'(Ci^fc), where Ci^k 
denotes the coclustering obtained from C by assigning gene i to cluster 
k, keeping all other assignments of genes and conditions equal, as well 
as the probability S(Ci—^o) for the gene to be alone in its own cluster. 
Assign gene i to one of the possible K + 1 gene clusters, where K 
is the current number of gene clusters, according to the probabilities 
Qfc oc e'^(''»^'=', normalized such that J^j, = 1. 

3. For each gene cluster k, for M cycles, remove a random condition 
m from its cuiTent cluster. For each condition cluster I, calculate the 
Bayesian score S{Ck^m^l)- Assign condition m to one of the possible 
Lj. + 1 clusters, where Lj; is the current number of condition clusters 
for gene cluster k, according to the probabilities Qi oc e'^^''''''™^'^ 
normalized such that Y2i Ql — ^■ 

4. Iterate step 2 and 3 until convergence. One iteration is defined as execu- 
ting step 2 and 3 consecutively once, and hence consists of N + K X AI 
sampling steps (with K the number of gene clusters after Step 1 of that 
iteration). 

This coclustering algorithm simulates a Markov chain which satisfies 
detailed balance with respect to the posterior distribution (1), i.e., after a suf- 
ficient number of iterations, the probability to visit a particular coclustering 
C is given exactly by p(C | T>). The expectation value of any real function / 
with respect to the posterior distribution can be approximated by averaging 
over the iterations of a sufficiently long Gibbs sampler run: 
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where Ct is the coclustering visited at iteration t and To is a possible burn- 
in period. We say that the Gibbs sampler has converged if two runs starting 
from different random initializations return the same averages (3) for a sui- 
table set of test functions /. More precisely, if {fn} is a set of test functions, 
define a„ = -Ei{/n) the average of /„ in the first Gibbs sampler run, and 
b„ = i?2{/n) the average of /„ in the second Gibbs sampler run. We define 
a correlation measure p (0 < p < 1) between two runs as 



^ En anb,i\ 

Full convergence is reached if p = 1. 



(4) 



Fuzzy clustering 

To keep track of the gene clusters, independent of the (varying) number of 
clusters or their labeling, we consider functions 

/ (C) < ^ gene i and j belong to the same gene cluster in C 
1 otherwise 

In general, the posterior distribution ( 1 ) is not concentrated on a single coclu- 
stering and the matrix F = {E(fij)) of expectation values (see eq. (3)) 
consists of probabilities between and 1. To quantify this fuzzyness, we use 



N'2 In 2 



(6) 



where TV is the dimension of the square matrix F and 

h{q) = -gln{q) - (1 - (?) ln(l - g) for < g < 1. 

For a hard clustering (Fij = or 1 for all //fuzzy = 0, and for a 
maximally fuzzy clustering (Fij = 0.5 for all i,j), //fuzzy = 1- In reality, 
the matrix F is very sparse (most gene pairs will never be clustered together), 
so //fuzzy remains small even for real fuzzy clusterings. 

We assume that a fuzzy gene-gene matrix F is produced by a fuzzy clu- 
stering of the genes, i.e., we assume that each gene i has a probability pn^ 
to belong to each cluster k, such that Y2k Pik ~ ^- '^'^ extract these proba- 
bilities from F we use a graph spectral method (Inoue and Urahama, 1999), 
originally developed for pattern recognition and image analysis, modified 
here to enforce the normalization conditions onpn^. A fuzzy cluster is repre- 
sented by a column vector w = {wi, . . . ,u!jv)"^> with Wi the weight of 



gene i in this cluster, normalized such that ||u!|| 



The cohesiveness of the cluster with respect to the gene-gene matrix F is 
defined as w'^ Fw = ■ ■ WiFijWj. By the Rayleigh-Ritz theorem. 



max — = — 



■ v"[ Fv\ 



Ai 



where Ai is the largest eigenvalue of F and v\ the corresponding (normali- 
zed) eigenvector. Hence the maximally cohesive cluster in F is given by the 
eigenvector of the largest eigenvalue. By the Perron-Frobenius theorem, this 
eigenvector is unique and all its entries are nonnegative. We can then define 
the membership probabilities to cluster 1 by pii = 



— r. Hence the 

gene with the highest weight in v\ is considered the prototypical gene for this 
cluster, and it will not belong to any other cluster. The probability mea- 
sures to what extent other genes are coexpressed with this prototypical gene. 
To find the next most cohesive cluster, we remove from F the information 
already contained in the first cluster by setting 



r(2) 



^1 - PiiFij v'l - Pji, 



and compute the largest eigenvalue and con'esponding (normalized) eigen- 
vector V2 for this matrix. The prototypical gene for this cluster may already 
have some probability assigned to the previous cluster, so we define the 
membership probabilities to the second cluster by 



Pi2 



l),1-P»i 



Vmaxj(i;2,j) 

Here 't,„ax = arg max^ {v2,j ) is the prototypical gene for the second cluster, 
and we take the 'min' to ensure that J^^, pn^ will never exceed 1. 

This procedure of reducing F and computing the largest eigenvalue and 
coiTesponding eigenvector to define the next cluster membership probabili- 
ties is iterated until one of the following stopping criteria is met: 

1. All entries in the reduced matrix /<'('') reach 0, i.e., for all genes, 
Efe'<fcPifc' = 1" ws ha.ve completely determined all fuzzy 
clusters and their membership probabilities. 

2. The largest eigenvalue of the reduced matrix /<'(*' has rank > 1. In 
this case the eigenvector is no longer unique and need no longer have 
nonnegative entries, so we carmot make new cluster membership pro- 
babilities out of it. This may happen if the (weighted) graph defined 
by connecting gene pairs with non-zero entries in /<'(*' is no longer 
strongly connected (Perron-Frobenius theorem). 

To compute one or more of the largest eigenvalues and eigenvectors for 
large sparse matrices such as F and its reductions /*'('') we use efficient 
sparse matrix routines, such as for instance implemented in the Matlab® 
function eigs. 

Data sets 

We use three large compendia of gene expression data for budding yeast: 
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1. Gasch etal. (2000) data set: expression in 173 stress related conditions. 

2. Hughes et al. (2000) data set: compendium of expression profiles 
corresponding to 300 diverse mutations and chemical treatments. 

3. Spellman et al. (1998) data set: 77 conditions for alpha factor aiTest, 
elutriation, and arrest of a cdcl5 temperature-sensitive mutant. 

We select the genes present in all three data sets (6052 genes) and, to be as 
unbiased as possible, no further postprocessing is done. We use SynTReN 
(Van den Bulcke et al, 2006) to generate simulated data sets with varying 
number of conditions for a synthetic transcription regulatory network with 
1000 genes (see also Supplementary Information). 

Functional coherence 

To estimate the overall biological relevance of the clusters we use a method 
which calculates the mutual information between clusters and GO attributes 
(Gibbons and Roth, 2002). For each GOslim attribute, we create a cluster- 
attribute contingency table where rows are clusters and columns are attribute 
status ('Yes' if the gene possesses the attribute. Wo' if it is not known 
whether the gene possesses the attribute). The total mutual information is 
defined as the sum of mutual informations between clusters and individual 
GO attributes: 

MI = ^H(C) + H{A) ~ H(C,A) (1) 

A 

where C is a clustering of the genes, A is a GO attribute and H is Shannon's 
entropy, H = — pi log(pi), and the p; are probabilities obtained from 
the contingency tables. 

3 RESULTS AND DISCUSSION 

Convergence of the Gibbs sampler algorithm 

We study convergence using the test functions fij which indicate if 
gene i and j are clustered together or not (see eq. (5) in the Methods) 
and compute the correlation measure p between different runs for 
this set of functions (see eq. (4) in the Methods). In addition to the 
correlation measure, we also compute the entropy measure -fffuzzy 
(see eq. (6) in the Methods). This parameter summarizes the 'shape' 
of the posterior distribution: a value of corresponds to hard cluste- 
ring which implies that the distribution is completely supported on a 
single solution, the more positive //fuzzy is, the more the distribution 
is supported on multiple solutions. 

In the analysis below we use subsets from the Gasch et al. data set 
with a varying number of genes and conditions and perform multiple 
Gibbs sampler runs with a large number of iterations. One iteration 
involves a reassignment of all genes and all conditions in all clusters, 
and hence involves N+M x K sampling steps in the Gibbs sampler, 
where A'^ is the number of genes, M the number of conditions, and 
K the number of clusters at that iteration (typically K ~ ^/N). 

First we consider a very small data set (100 genes, 10 conditi- 
ons). We start two Gibbs sampler runs in parallel and compute the 
correlation measure p at each iteration, see Figure 1. In this case, 
p approaches its maximum value p = 1 in less than 5000 iterati- 
ons and the Gibbs sampler generates a well mixing chain which can 
easily explore the whole space. Non-zero values of the entropy mea- 
sure i/fuzzy (0.105 ± 0.003) indicate that the posterior distribution 
is supported on multiple clusterings of the genes. 

Next we run the Gibbs sampler algorithm on a data set with 1000 
genes and all 173 conditions. Unlike in the previous situation we 
observe that the correlation between two Gibbs sampler runs satura- 
tes well below 1 (see Figure 1). Hence the Gibbs sampler does not 
converge to the posterior distribution in one run. We can gain fur- 
ther understanding for the lack of convergence by looking in more 
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Fig. 1. Trace plot of the correlation measure p between two different Gibbs 
sampler runs as a function of the number of iterations, for a small data set 
(100 genes, 10 conditions, top curve) and a large data set (1000 genes, 173 
conditions, bottom curve). Both data sets are subsets of the Gasch et al. data 
set. 

detail at a single Gibbs sampler run. It turns out that the correlation 
measure between two successive iterations reaches 1 very rapidly 
and remains unchanged afterwards (See Supplementary Figure 2). 
Since each iteration involves a large number of sampling steps {i.e., 
a large number of possible configuration changes), this implies that 
the Gibbs sampler very rapidly finds a local maximum of the poste- 
rior distribution from which it can no longer escape. We conclude 
that the posterior distribution is supported on multiple local maxima 
which overlap only partially, and with valleys in between that can- 
not be crossed by the Gibbs sampler. These local maxima all have 
approximately the same log-likelihood (see for instance the small 
variance in Figure 4 below) and are therefore all equally meaningful. 
The probability ratio between peaks and valleys is so large (expo- 
nential in the size of the data set) that an accurate approximation 
to the posterior distribution is given by averaging over the local 
maxima only. Those can be uncovered by performing multiple inde- 
pendent runs, each converging very quickly on one of the maxima, 
and there is no need for special techniques to also sample in bet- 
ween local maxima. The number of local maxima (Gibbs sampler 
runs) necessary for a good approximation can be estimated as fol- 
lows. We perform 150 independent Gibbs sampler runs and compute 
for each the pairwise gene-gene clustering probability matrix F (see 
Methods). For each k — 1, . . . , 50, we take two non-overlapping 
sets of k solutions and compute the average of their pairwise pro- 
bability matrices F. Then, we compute the correlation measure p 
between those two averages. This is repeated several times, depen- 
ding on the number of non-overlapping sets that can be chosen from 
the pool of 150 solutions. If for a given k the correlation is always 
1, then there are at most k local maxima. Figure 2 shows that as k 
increases, the correlation quickly reaches close to this perfect value 
1. This implies that the number of local maxima is not too large and 
a good approximation to the posterior distribution can be obtained 
in this case already with 10 to 20 solutions. Supplementary Figure 1 
shows an example of hard clusters formed as a result of a single run 
and fuzzy clusters formed by merging the result of 10 independent 
runs. 



4 



0.9 



0.8 



0.7 



0.6 



0.5 



0.4 



10 15 20 25 30 35 40 
Number of cluster solutions merged 



45 50 



Fig. 2. Correlation measure p between different averages of tlie same num- 
ber of local maxima for a data set of 1000 genes and 173 conditions (subset 
of the Gasch et al. data set). 



In Figure 3, we keep the same 1000 genes and select an incre- 
asing number of conditions. As the data set increases, the entropy 
measure Hfuzzy decreases, meaning the clusters become increasin- 
gly hard. Simultaneously, the correlation measure p decreases from 
about 0.85 to 0.55 (see Supplementary Figure 3). We conclude that 
the depth of the valleys between different local maxima of the poste- 
rior distribution increases with the size of the data set and it becomes 
increasingly more difficult for the Gibbs sampler to escape from 
these maxima and visit the whole space in one run. 



The rapid convergence of the log-likelihood shows that the Gibbs 
sampler reaches the local maxima very quickly and the low variance 
shows that the different local maxima are all equally likely. The 
average over 10 runs of the GO mutual information score (see eq. 
(7) in the Methods) shows the same rapid convergence and small 
variance (see Supplementary Figure 6), implying that the different 
maxima are biologically equally meaningful according to this score. 
The correlation between different averages of 10 Gibbs sampler runs 
reaches 0.85, a value we consider high enough for a good approxi- 
mation of the posterior distribution. The other two data sets show 
precisely the same behavior (see Supplementary Figures 4 and 5). 




10 20 30 40 50 60 70 80 90 100 
Number of Iterations 

Fig. 4. Trace plot of the average log-likeUhood score and standard deviation 
for 10 Gibbs sampler runs for the Spellman et al. data set. 
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Fig. 3. Entropy measure //fuzzy for data sets with 1000 genes and varying 
number of conditions (subsets of the Gasch et al. data set). 



Analysis of whole genome data sets 

If we run the Gibbs sampler algorithm on the three whole genome 
yeast data sets, we are in the situation where the algorithm very 
rapidly gets stuck in a local maximum. In Figure 4 we plot the ave- 
rage Bayesian log-likelihood score (see eq. (2) in the Methods) for 
10 different Gibbs sampler runs for the Spellman et al. data set. 



Two-way clustering versus one-way clustering 

Our coclustering algorithm extends the CRC algorithm of Qin 
(2006) by also clustering the conditions for each cluster of genes 
{ 'two-way clustering'), instead of assuming they are always inde- 
pendent {'one-way clustering'). We compare the clustering of genes 
for the three yeast data sets using both methods, by computing the 
average number of clusters inferred (A'), the average log-likelihood 
score and the average GO mutual information score for 10 inde- 
pendent runs of each algorithm. The results are tabulated in Table 
1 and 2. For all three data sets, both the log-likelihood score and 
the GO mutual information score are higher (better) for our method. 
The increase in GO mutual information score is especially signifi- 
cant in case of the Hughes et al. data set. This data set has very few 
overexpressed or repressed values and if each condition is conside- 
red independent, there are very few distinct profiles which results 
in the formation of very few clusters (~ 15 for 6052 genes). Also 
clustering the conditions gives more meaningful results since diffe- 
rentially expressed conditions form separate clusters from one large 
background cluster of non-differentially expressed conditions. 

For simulated data sets, clusters are defined as sets of genes sha- 
ring the same regulators in the synthetic regulatory network, and 
the true number of clusters is known. Here we consider a gene net- 
work whose topology is subsampled from an E. coli transcriptional 
network (Van den Bulcke et al., 2006) with 1000 genes, of which 
105 transcription factors, and 286 clusters. For two-way clustering, 
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Table 1. One-way clustering, averages for 10 different Gibbs sampler runs. 



Data set 


Avg. K 


Avg. log-likelihood score 


Avg. MI 


Gasch et al. 
Hughes et al 
Spellman et al. 


52.9(2.6) 
14.9(0.5) 
49.7(2.2) 


-6.101(0.014) X 10"^ 
2.530(0.002) X lO'* 
-7.183(0.037) X 10* 


1.771(0.031) 
0.588(0.044) 
1.491(0.032) 


Table 2. Two-way clustering. 


averages for 10 different Gibbs sampler runs. 


Data set 


Avg. K 


Avg. log-likelihood score 


Avg. MI 


Gasch et al. 
Hughes et al 
Spellman et al. 


84.5(2.5) 
85.5(2.7) 
65.4(4.2) 


-5.586(0.012) X 10^ 
2.798(0.004) X 10** 
-5.112(0.011) X 10"* 


1.912(0.033) 
1.511(0.045) 
1.612(0.032) 



a.s we increase the number of conditions in the simulated data set, 
more clusters are formed and the number of clusters saturates close 
to the true number (see Figure 5). For one-way clustering, addition 
of conditions does not affect the inferred number of clusters which 
is an order of magnitude smaller than the true number (see Figure 
5). For two-way clustering, due to the clustering of conditions, the 
number of model parameters is reduced, and greater statistical accu- 
racy can be achieved, even when the number of genes in a cluster 
becomes small. 

The correlation measure p between true clusters and inferred clu- 
sters also shows a higher value for two-way clustering over one-way 
(Supplementary Figure 8). 

Unlike for simulated data sets, the inferred number of clusters 
does not depend much upon the number of conditions for real 
biological data sets (Supplementary Figure 7), i.e., even if more 
conditions are added, the algorithm does not generate more clusters. 
This is because in simulated data, every addition of a condition adds 
new information, but for real data sets that might not be the case. In 
order to get the true clusters from the expression data, we do not only 
need more conditions but also that each new condition contributes 
information different from the information already available from 
the previous conditions. This might be a reason why the algorithm 
clusters 6052 genes in only ~ 80 clusters (see Table 2). 

Fuzzy clusters 

Our algorithm returns a summary of the posterior distribution in the 
form of a gene-gene matrix whose entries are the probabilities that a 
pair of genes is clustered together. To convert these pairwise proba- 
bilities back to clusters we use a graph spectral method as explained 
in the Methods. The method produces fuzzy overlapping clusters 
where each gene i belongs to each fuzzy cluster k with a probabi- 
lity Pik, such that ^f.Pik ~ 1- The size of a fuzzy cluster k is 
defined as "^^Pik- The algorithm iteratively produces new fuzzy 
clusters until all the information in the pairwise matrix is conver- 
ted into clusters (l*"' stopping criterium, see Methods), or until the 
mathematical conditions underlying the algorithm cease to hold (2"^" 
stopping criterium, see Methods). We applied the algorithm to pair- 
wise probability matrices for each of the three data sets, obtained 
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Fig. 5. Number of gene clusters for a simulated data set with 1000 genes 
and a varying number of conditions, for two-way clustering (top data points 
(x)) and one-way clustering (bottom data points (+)) 



by averaging over 10 different Gibbs sampler runs. For the Gasch 
et al. and Hughes et al. data sets, full fuzzy clustering is achieved 
with 500 fuzzy clusters (all 6052 genes have total assignment pro- 
bability pik > 0.98). For the Spellman et al. data set the second 
stopping criterium is met after producing 321 fuzzy clusters. 

In general, we observe that the algorithm first produces one very 
large fuzzy cluster corresponding to an average expression profile 
that almost all genes can relate to. This cluster is of no interest for 
further analysis. Then it produces a number of fuzzy clusters of 
varying size which show interesting coexpression profiles and are 
useful for further analysis. For the three data sets considered here, 
this number is around 100, consistent with the average number of 
clusters in different Gibbs sampler runs (see Table 2). The remai- 
ning fuzzy clusters are typically very small and consist mostly of 
noise. Like the very first cluster, they are of no interest for further 
analysis. 

Since every gene belongs to every cluster, we use a probability 
cutoff to remove from each cluster the genes which belong to it 
with a very small probability. The smaller the cutoff, the more genes 
belong to a cluster, which results into more fuzzy clusters and vice 
versa. Table 3 shows the total number of genes assigned to at least 
one fuzzy cluster with different cutoff values and in brackets the 
number of genes assigned to at least two fuzzy clusters. 

The goal of merging different Gibbs sampler solutions and for- 
ming fuzzy clusters is to extract additional information out of a data 
set that is not captured by a single hard clustering solution. This can 
be achieved in two ways. First, by obtaining tight clusters of few but 
highly coexpressed genes with a high probability cutoff. Second, 
by characterizing genes which belong to multiple clusters with a 
significant probability. 

For all three data sets, at a probability cutoff of 0.5, we get a sub- 
set of genes which belong to only one cluster with high probability. 
Table 3 shows that each data set retains at least 20% of its genes. 
These are sets of strongly coexpressed genes which cluster together 
in almost every hard cluster solution. Ribosomal genes show such 
a strong coexpression pattern in all the three data sets where most 
genes belong to this cluster with a probability close to 1 (see Figure 
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Table 3. Number of genes clustered and number of genes 
belonging to multiple clusters with different membership 
probability cutoff values. 



Data set 


0.1 


0.3 


0.5 


Gasch et al. 


6045 (4356) 


4062 (344) 


1781 (0) 


Hughes et al. 


6052 (4554) 


3959 (34) 


2254 (0) 


Spellman et al. 


6052 (5187) 


3158 (139) 


1255 (0) 



6). At least 75% of all the genes in cluster 2 (Gasch et al. data), 
cluster 3 (Hughes et al. data) and cluster 2 (Spellman et al. data) are 
located in ribosome. 



Fig. 6. Ribosomal genes form a tight cluster in the Hughes et al. data set. 
(Due to space constraints only the first few genes are shown; for the complete 
figure, see the Supplementary Information.) 



ERP2, RET2, RET3, SEC13, SEC21, SEC24 and others. Cluster 34 
contains genes repressed under nitrogen stress and stationary state. 
20 percent of the genes in cluster 27 also belong to cluster 34 with 
a significant membership. These include genes encoding for ER 
vesicle coat proteins like RET2, RET3, SEC13 and others which 
are induced under DTT stress as well as repressed under nitrogen 
stress and stationary state. Also RIOl, an essential serine kinase, 
belongs to two clusters with a significant probability. It clusters with 
genes involved in ribosomal biogenesis and assembly (Gasch et al. 
data cluster 3) as well as with genes functioning as generators of 
precursor metabolites and energy (Gasch et al. data cluster 7). We 
find similar observations for the Hughes et al. and Spellman et al. 
datasets. Genes CLNl, CLN2 and other DNA synthesis genes like 
CLB6 which are known to be regulated by SBF during SI phase 
(Koch et al, 1996) belong to cluster 19 (Spellman et al. data). They 
also belong with significant probability to cluster 4 (Spellman et al. 
data). More than one third of the genes in cluster 4 are predicted to 
be cell cycle regulated genes. 

CONCLUSION 

We have developed an algorithm to simultaneously cluster genes and 
conditions and sample such coclusterings from a Bayesian probabi- 
listic model. For large data sets, the model is supported on multiple 
equivalent local maxima. The average of these local maxima can 
be represented by a matrix of pairwise gene-gene clustering proba- 
bilities and we have introduced a new method for extracting fuzzy, 
overlapping clusters from this matrix. This method is able to extract 
information out of the data set that is not available from a single, 
hard clustering. 



Local but very strong coexpression patterns can also be detected 
by our method. Cluster 15 of the Gasch et al. dataset consists of only 
4 genes clustered together with probability 1 (see Figure 7). These 
four genes, GALl, GAL2, GAL7, and GAL 10, are enzymes in the 
galactose catabolic pathway and respond to different carbon sources 
during steady state. They are strongly upregulated when galactose 
is used as a carbon source (2""* experiment cluster in Figure 7) and 
strongly downregulated with any other sugar as a carbon source (l*"' 
experiment cluster in Figure 7). In every hard cluster solution, these 
4 genes are clustered together along with other genes. By merging 
these hard cluster solutions to form fuzzy clusters, we get a tight but 
more meaningful cluster with only 4 genes. 



Fig. 7. Four genes GALl, GAL2, GAL7 and GAL 10 form a tight cluster 
showing conditional coexpression in the Gasch et al. data set. 



Table 3 shows that many genes belong to two or more clusters 
with a significant probability. For the Gasch et al. data set, we find 
similar observations as in (Gasch and Eisen, 2002). Cluster 27 con- 
tains genes localized in endoplasmic reticulum (ER) and induced 
under dithiothreitol (DTT) stress like FKB2, JEMl, ERD2, ERPl, 
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